Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shadowserver dynamic config #2372

Merged
merged 75 commits into from
Dec 18, 2023
Merged

Shadowserver dynamic config #2372

merged 75 commits into from
Dec 18, 2023

Conversation

elsif2
Copy link
Collaborator

@elsif2 elsif2 commented May 28, 2023

Dynamic configuration

Many thanks to Eric Halil for collaboration and testing on this update.

Parsers

  • intelmq.bots.parsers.shadowserver._config:
    • Switch to dynamic configuration to decouple report schema changes from IntelMQ releases. Please see intelmq/bots/parsers/shadowserver/README.md for details.

Collectors

  • intelmq.bots.collectors.shadowserver.collector_reports_api:
    • The 'json' option is no longer supported as the 'csv' option provides better performance. Please see intelmq/bots/parsers/shadowserver/README.md for a sample configuration.

@sebix
Copy link
Member

sebix commented May 30, 2023

@elsif2 the license check is failing

@elsif2
Copy link
Collaborator Author

elsif2 commented May 30, 2023

From the log:

The following files have no copyright and licensing information:
* intelmq/bots/parsers/shadowserver/schema.json.test

License added.

@sebix sebix added this to the 3.3.0 milestone Jul 19, 2023
@kamil-certat kamil-certat self-requested a review July 24, 2023 10:02
@aaronkaplan aaronkaplan self-requested a review July 27, 2023 10:04
@aaronkaplan aaronkaplan self-assigned this Jul 27, 2023
Copy link
Contributor

@kamil-certat kamil-certat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long waiting for the review. I have noted a few things that are important for me, mainly from the perspective of the IntelMQ user. I have marked with 🔴 things that are the most critical for me.

Generally, I like separating the schema from IntelMQ releases. I have unfortunately doubts in some decisions as I'd like to better control when and if the update is happening. I'd also recommend taking a look on how other bots handle such topic, e.g. ASN expert: https://github.com/certtools/intelmq/tree/bcccbf501ffa8b64c2ea011d457c3409ad1dd4cd/intelmq/bots/experts/asn_lookup

I have also identified a few points, where the code can break e.g. in our environment.

In addition, I'd recommend adding the regular update to the crontab installed with native packages: https://github.com/certtools/intelmq/tree/develop/contrib/cron-jobs

If you have any questions or need any help, you can hit me on the ShadowServer MM or email (but unfortunately I go on the vacation and be back at 21.08).

intelmq/bots/parsers/shadowserver/parser.py Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/parser.py Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/README.md Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/_config.py Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/_config.py Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/update_schema.py Outdated Show resolved Hide resolved
@kamil-certat
Copy link
Contributor

In addition, I'd like to ask if you can publish a policy on changes in the schema, the most important for me is in which cases you can change classification config (e.g. identifier, taxonomy)

@sebix
Copy link
Member

sebix commented Jul 27, 2023

I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list?

@elsif2 elsif2 force-pushed the shadowserver-dynamic-config branch from 77ac732 to 3d6a87e Compare July 27, 2023 20:31
@elsif2
Copy link
Collaborator Author

elsif2 commented Jul 27, 2023

In addition, I'd like to ask if you can publish a policy on changes in the schema, the most important for me is in which cases you can change classification config (e.g. identifier, taxonomy)

I started a Schema contract section to address concerns in the bots/parsers/shadowserver/README.md.

@aaronkaplan
Copy link
Member

I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list?

@sebix I had a good chat with elsif2 today and we said

  • I try out the changes, get it to work for myself
  • document things as I go
  • cross check with elsif2 as I got and we change things iteratively
  • we both start to write down sort of a "contract" / documentation on this . It's paramount to know for intelmq-users and -admins what the "contract" is: what fields may change dynamically, added fields will stay for ever, only new fields get added, how to deal with new fields? How often updates are needed? What happens when someone does not update? ETc. etc... If we write this down and make it clear, it is the basis for discussions and it helps also to improve the concept.

So, TL;DR: don't worry :) we are adressing concerns.

Copy link
Member

@aaronkaplan aaronkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share a lot of concerns with @kamil-certat here, but my main concern is: is there a way to be as little disruptive as possible? Or let me re-formulate: let's think hard how we can achieve this.

Because most of the time intelmq runs unattended, automatically, at night.
People don't notice it in the logs if something dynamic (like a config update ) happened.
Nor will they debug it.

We agreed in a chat today that

  • I would try out the change
  • We iterate over it and improve it
  • We document it
  • We document a type of "contract" between users/-admins and the dynamic update service
  • We need to have a opt-in/opt-out process for this.

I want to re-iterate that I think it is very very beneficial if intelmq users can easily get the latest and best shadowserver parsers. We just need to make sure it gets done right.

CHANGELOG.md Outdated

#### Parsers
- `intelmq.bots.parsers.shadowserver._config`:
- Reset detected `feedname` at shutdown to re-detect the feedname on reloads (PR#2361 by @elsif2, fixes #2360).
- Switch to dynamic configuration to decouple report schema changes from IntelMQ releases. Please see intelmq/bots/parsers/shadowserver/README.md for details. (PR#2372)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add a section in the official intelmq docs for the shadowserver feeds, since they are so crucial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is not the right place to document the bot for the users.
user documentation is in docs/user/bots.rst (https://intelmq.readthedocs.io/en/latest/user/bots.html#shadowserver) and intelmq/etc/feeds.yml (https://intelmq.readthedocs.io/en/latest/user/feeds.html). For upgrade steps, use NEWS.md.
Please move the information from the README there.

For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision.
The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set.

For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - what happens when people don't update in air-gapped environments? Is the parsing fail-safe then ? Like HTML ignores fields and element attributes it does not understand?

'classification.identifier': 'spam-url',
},
}
functions = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't check this huge diff, but I assume that's valid since automatically generated stuff gets removed and moved to the dynamic confgen

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is unchanged from the existing release. Errors are logged for new report types. Fields added to existing reports are ignored.

}


def reload():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. and have a mechanism to test the new config automatically. If it fails, the admin needs to know. And the developers need to know.

For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision.
The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set.

For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with sebix here: this would be a good initial step which helps it to work implicitly.

destination_queues:
_default: [shadowserver-parser-queue]
file_format: csv
api_key: "$API_KEY_received_from_the_shadowserver_foundation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also to give a specific query (as in the shadowserver API ) here? like asn:12345?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to also to give a specific query (as in the shadowserver API ) here? like asn:12345?

No. Some organizations do have separate report sets that can be handled independently using the reports option.

@sebix
Copy link
Member

sebix commented Aug 2, 2023

I haven't seen a discussion of this architectural change or and IEP yet. Or was it discussed on a non-public list?

* we both start to write down sort of a "contract" / documentation on this . It's paramount to know for intelmq-users and -admins what the "contract" is: what fields may change dynamically, added fields will stay for ever, only new fields get added, how to deal with new fields? How often updates are needed? What happens when someone does not update? ETc. etc... If we write this down and make it clear, it is the basis for discussions and it helps also to improve the concept.

This is a good step forward!

So, TL;DR: don't worry :) we are adressing concerns.

I'm not worried personally, as I'm not directly affected, but many users are. I want to contribute my experience to get the change into operation smoothly.

Needless to say, the proposed approach has its advantages.

Here are some thoughts:

  • The delegation of the mapping to a third party outside of the IntelMQ software also deprives the IntelMQ developers & users of the possibility of reviewing the field mappings before they become active.
  • Until now, only collectors fetch the Shadowserver data via API or from local sources like IMAP, file, ticket system, etc. IntelMQ doesn't need to be connected to the internet to parse Shadowserver data. With this change merged, IntelMQ gains an additional requirement on the operation. Operators need to consider that and eventually adapt their installations or environments.
  • Another issue is that it will take more work to parse older or historical data. Do you remember that a few years ago, we parsed Shadowserver data covering a few years of data for statistical purposes? We had the older Shadowserver feed mappings at hand, thus no issues parsing older data.

@aaronkaplan
Copy link
Member

Let me try to address some topics mentioned.

  • The delegation of the mapping to a third party outside of the IntelMQ software also deprives the IntelMQ developers & users of the possibility of reviewing the field mappings before they become active.

nope, since you can always fetch the newest mappings manually and review. So, I don't see a problem here.
But yes, we could and should try out the "air-gapped" mode and see if it's actually feasible before merging. Agreed with that.

  • Until now, only collectors fetch the Shadowserver data via API or from local sources like IMAP, file, ticket system, etc. IntelMQ doesn't need to be connected to the internet to parse Shadowserver data. With this change merged, IntelMQ gains an additional requirement on the operation. Operators need to consider that and eventually adapt their installations or environments.

Again see above, it will also work in airgapped mode and pulling the latest mappings will be similar to doing an intelmq update manually.

  • Another issue is that it will take more work to parse older or historical data. Do you remember that a few years ago, we parsed Shadowserver data covering a few years of data for statistical purposes? We had the older Shadowserver feed mappings at hand, thus no issues parsing older data.

This is a very relevant point. I think this needs to be in the contract that historic fields don't get remapped or changed or deleted.

@kamil-certat
Copy link
Contributor

Thanks for changes, @elsif2 I see most of my technical concerns have been addressed :)

About the general discussion:

re: delegation of schema outside IMQ

I see this as a trade-off problem, but I think ShadowServer has a very good reason to delegate it outside as their reports are developing much quicker than IntelMQ. In addition, I see a possibility for automated upgrades of this schema as a big facilitation for admins - even with a very quick adoption in IMQ packages, you still need to upgrade the whole IntelMQ instead of just the schema. Yes, it has some cons, but I think the pros are stronger in this case.

re: air-gaped systems

This is why I'd suggest keeping default, regularly updated (e.g. once a month) schema in the repo. Then your air-gaped system can still work after installation, just without additional configuration it would have a delay in adoption of schema changes.

re: historical schema

This is a good point. But I'd just suggest keeping the history of schemas, in to ways:

  • do not delete old schema, create a history folder instead and store them all downloaded schemas, named after the schema version.
  • For cases when you don't have the schema, @elsif2 would it be possible, that ShadowServer publish a public catalog with all historical schemas?

As I understand, the schema is versioned by time, so it should be easy to choose the right one if you need to perform a reparsing of old reports - assuming that ShadowServer would ensure that date of creation a schema means, that starting this or the next day the schema is used (please clarify it in the contract)

CHANGELOG.md Outdated Show resolved Hide resolved

__config.feedname_mapping.clear()
__config.filename_mapping.clear()
for schema_file in [__config.schema_file, __config.schema_base]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a little why load schema_base, that looks to be a test schema, every time? If the only usage is for testing, I'd suggest loading it just during tests by specifying bot's parameters.

If this has more usage, I'd suggest loading base first. Otherwise, the base can override the updated schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test_mode to switch between schema files.

intelmq/bots/parsers/shadowserver/_config.py Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/_config.py Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/_config.py Outdated Show resolved Hide resolved
intelmq/bots/parsers/shadowserver/parser.py Outdated Show resolved Hide resolved
For environments that have internet connectivity the `update_schema.py` script should be called from a cron job to obtain the latest revision.
The parser will attempt to download a schema update on startup unless INTELMQ_SKIP_INTERNET is set.

For air-gapped systems automation will be required to download and copy the _schema.json_ file into this directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which is better (starting schema vs. download when missing), but I can accept any of them.

@elsif2
Copy link
Collaborator Author

elsif2 commented Aug 24, 2023

The complete revision history of the schema files will be available at https://github.com/The-Shadowserver-Foundation/report_schema.

@elsif2 elsif2 force-pushed the shadowserver-dynamic-config branch from d435ea3 to 516d534 Compare November 4, 2023 00:29
@elsif2 elsif2 force-pushed the shadowserver-dynamic-config branch from 516d534 to 61c756d Compare November 4, 2023 00:44
@kamil-certat
Copy link
Contributor

Thank you!

@sebix sebix dismissed aaronkaplan’s stale review November 28, 2023 09:24

Already reviewed some parts. Major parts are reviewed by Kamil and Aaron

@aaronkaplan
Copy link
Member

looks like my requests have been met.
Testing again now.

@sebix
Copy link
Member

sebix commented Nov 29, 2023

I wanted to cancel the pending review for myself requested by Aaron, but it looks like I instead canceled one of Aaron's reviews 🤔

Anyway, I won't review this PR again, as I already reviewed some parts of it (core, docs). The major and more relevant parts have already been reviewed by Kamil and Aaron. I don't want to duplicate their efforts. I shared my thoughts on the change in general already above (#2372 (comment))

@aaronkaplan aaronkaplan requested review from aaronkaplan and removed request for sebix November 29, 2023 19:37
@aaronkaplan
Copy link
Member

@sebix all good, I am still looking at it. It's a mega mega mega commit. Need to do a smaller one the next time to make the review process faster. Most of us have other duties / daytime jobs etc.

@kamil-certat
Copy link
Contributor

This branch is currently in production usage on our instance, so far I do not see any issues.

@DigitalTrustCenter
Copy link
Contributor

We would really like to see this feature merged in IntelMQ. We know the pain of manually merging the _config.py whenever new reports are added too well. The changes made to host(name) on the 18th of November showed the problem where we would require a different config from one day to another. The dynamic config would fix those kinds of issues as well.

We do however see a problem. We run a custom config with a couple of changes compared to the upstream config:

  • We changed the classification.identifer of scan_http_vulnerable from accessible-http to vulnerable-http. accessible-http is also used by scan_http, which is a completely different report.
  • event_honeypot_brute_force, event_honeypot_darknet, event_sinkhole_http and event_sinkhole map fields like tag to classification.identifer. We would like a constant value for classification.identifer, so we moved it to the constant_fields section with values honeypot-brute-force, darknet, sinkhole-http and sinkhole.
  • sandbox_url and sandbox_conn (unnecessary in our opinion) mix source and destination fields. We changed destination.fqdn and destination.to source.fqdn and source.url.

We would not be surprised if other organizations have their own changes that they made to the config.
A solution would be to allow us to configure a path to a JSON patch document (RFC 6902) in the config. This patch is applied to the schema JSON after a new version is downloaded in the update process.

@kamil-certat
Copy link
Contributor

kamil-certat commented Dec 6, 2023

Hey, we indeed have a couple of custom changes (e.g. filters by tags, own identifiers etc.), but we use a sieve bot for that - and as long as the feed.code is stable, our bot handles that independently. I would recommend this approach instead of modifying ´_config.py` or patching schema

@DigitalTrustCenter
Copy link
Contributor

Thanks for the pointer. We are now looking into improving our setup using the sieve bot. The only thing it cannot solve is the change of the fields for sandbox_url (you cannot add a new field with a value from another field). Maybe those fields should be changed in the upstream schema itself. However, that is not relevant for this PR.

@elsif2
Copy link
Collaborator Author

elsif2 commented Dec 12, 2023

The change occurred between 3.0.2 and 3.1.0. I will correct this in the schema.

3.0.2

sandbox_url = {
    'required_fields': [
        ('time.source', 'timestamp', add_UTC_to_timestamp),
        ('source.ip', 'ip'),
    ],
    'optional_fields': [
        ('source.asn', 'asn', invalidate_zero),
        ('source.geolocation.cc', 'geo'),
        ('malware.hash.md5', 'md5hash'),
        ('source.url', 'url'),
        ('user_agent', 'user_agent', validate_to_none),
        ('source.fqdn', 'host', validate_fqdn),
        ('extra.', 'method', validate_to_none),
    ],
    'constant_fields': {
        'classification.taxonomy': 'malicious-code',
        'classification.type': 'malware-distribution',
        'classification.identifier': 'sandbox-url',
    },
}

3.1.0

sandbox_url = {
    'required_fields': [
        ('time.source', 'timestamp', add_UTC_to_timestamp),
        ('source.ip', 'ip', validate_ip),
    ],
    'optional_fields': [
        ('destination.fqdn', 'host', validate_fqdn),
        ('extra.http_request_method', 'method', validate_to_none),
        ('source.asn', 'asn', invalidate_zero),
        ('source.geolocation.cc', 'geo'),
        ('malware.hash.md5', 'md5', validate_to_none),
        ('destination.url', 'url', convert_http_host_and_url, True),
        ('user_agent', 'user_agent', validate_to_none),
    ],
    'constant_fields': {
        'classification.taxonomy': 'malicious-code',
        'classification.type': 'malware-distribution',
        'classification.identifier': 'sandbox-url',
    },
}

sandbox_conn = {
    'required_fields': [
        ('time.source', 'timestamp', add_UTC_to_timestamp),
        ('source.ip', 'ip', validate_ip),
        ('source.port', 'port', convert_int),
    ],
    'optional_fields': [
        ('destination.fqdn', 'host', validate_fqdn),
        ('source.asn', 'asn', invalidate_zero),
        ('source.geolocation.cc', 'geo'),
        ('malware.hash.md5', 'md5', validate_to_none),
        ('protocol.transport', 'protocol'),
        ('extra.', 'bytes_in', validate_to_none),
        ('extra.', 'bytes_out', validate_to_none),
    ],
    'constant_fields': {
        'classification.taxonomy': 'malicious-code',
        'classification.type': 'malware-distribution',
        'classification.identifier': 'sandbox-conn',
    },
}

@sebix
Copy link
Member

sebix commented Dec 13, 2023

Thanks for the pointer. We are now looking into improving our setup using the sieve bot. The only thing it cannot solve is the change of the fields for sandbox_url (you cannot add a new field with a value from another field).

That is true. As a workaround, you could use the Jinja2 Template Expert or the Modify Expert

Copy link
Member

@aaronkaplan aaronkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been possibly the longest running PR in intelmq.
But it's also the most complex one I have ever seen.
I think @elsif2 managed to stun us all.
Checking and testing this rigorously is pretty much out of the question.
But what I tested and was running on my setup, it looks good.

Thanks a ton for this important PR and sorry that it took months. Godspeed, shadowserver!

docs/user/bots.md Show resolved Hide resolved
@aaronkaplan
Copy link
Member

And all checks have passed :) merging in...

@aaronkaplan aaronkaplan merged commit 4743ba9 into develop Dec 18, 2023
31 checks passed
@sebix sebix deleted the shadowserver-dynamic-config branch December 19, 2023 07:01
Comment on lines +159 to +165
- `intelmq.bots.collectors.shadowserver.collector_reports_api`:
- The 'json' option is no longer supported as the 'csv' option provides better performance.

#### Parsers
- `intelmq.bots.parsers.shadowserver._config`:
- Reset detected `feedname` at shutdown to re-detect the feedname on reloads (PR#2361 by @elsif2, fixes #2360).
- `intelmq.bots.parsers.shadowserver._config`:
- Switch to dynamic configuration to decouple report schema changes from IntelMQ releases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog entry is in the wrong section (3.2.0) instead of 3.2.2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be fixed with #2447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants